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A  PARALLEL-DESIGN   DISTRIBUTED- LMPLEMENTATION    (PDDI) 
GENERAL-PURPOSE   CO^L?UTER 

Uzi  Vishkin 
Courant   Institute,   New  York  Univesity 

ABSTRACT 

A  scheme  of  an  efficient  general-purpose  parallel  computer  is 
introduced.  Its  design  space  (i.e.,  the  model  for  which  parallel 
programs  are  written),  is  a  permissive  parallel  RAM  model  of 
computation.  The  implementation  space  is  presented  as  a  scheme  of  a 
synchronous  distributed  machine  which  is  not  more  involved  than  a 
sorting  network  followed  by  a  merging  network.  An  efficient 
translation  from  the  design  space  into  the  implementation  space  is 
given.  Suppose  for  some  t  and  x  there  is  a  parallel  algorithm  in  the 
design  space  which  has  depth  (i.e.,  parallel  time),  0(t/p)  using  p 
processors  for  all  p  <  x.  This  translates  to  an  algorithm  in  the 
implementation  space  with  depth  0(t/s)  for  all  s  <  t/£  where  I  depends 
on  the  choice  of  the  sorting  and  merging  networks,  s  is  the  number  of 
"powerful"  processors  used  (processors  not  in  the  sorting  or  merging 
networks)   and  f(s,m)  auxiliary  processors,  where  m  is  the  size  of  the 

common  memory   in  the   design   space.    For   a   specific   choice, 

2  2 

i    =  log  s  +  log  m    and    f(s,m)  =0(  s  log  s  +  m  log  m),    comparing 

favorably  with  alternative  known  solutions.  Since  many  parallel 
algorithms  are  designed  for  a  wide  range  of  processors  our  solution 
pays  the  fine  for  implementation  where  it  hurts  least. 

I.  INIRODUCTION  ' 


This  paper  is  motivated  by  the  fact  that  the  tremendous  potential 
power  of  microstructure  technology  can  be  realized  only  if  we  find 
effective  paraliil  architectures  and  algorithras  for  utilizing  large 
numbers  of  small  but  pov;erful  processors.  On  onc:  hand,  synchrciou^- 
shared  memory  models  of  parallel  computation  have  been  shov^m   to   be   a 
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very  effective  framework  for  designing  algorithms  for  many  problems. 
On  the  other  hand,  physical  limitations  of  currently  available 
technologies  suggest  one,  but  only  one,  basic  constraint:  in  a  machine 
built  as  an  assemblage  of  a  large  number  of  processing  elements,  each 
processor  can  be  connected  only  to  a  fixed  number  of  other  processors, 
and   this    in  a   fixed   pattern. 

In  order  to  support  the  claim  that  such  models  of  parallel  computation 
are  effective  we  mention  a  few  salient  algorithm  that  can  be 
implemented  in  them.  (Most  of  -hesri  algorithms  were  designed  for  such 
models).  Finding  the  maximum  among  n  elements  ([SV-81]).  Merging 
([BH-82]  and  [SV-81]).  Sorting  ([AKS-83],  [BH-82],  [Hi-78],  [Pr-78] , 
[SV-81]        and       more).  Computing      'onvex     hulls      in      two      dimentions 

([NMB-81]).  Computing  connected  components  of  undirected  graghs 
([CLC-82],  [HCS-79],  [SV-82a],  [Vi-81h]  and  [Ky-79]).  Computing 
biconnected  components  of  an  undirected  graph  ([TV-83]).  Algorithms  on 
trees  ([Me-81]  and  [TV-83]).  Data  structures  (  [PV\-J-83]  ) .  Finding 
max-flow  in  a  network  ([SV-82b]).  Numeruus  numerical  algorithms  (for  a 
survey  see    [He-78] ).  .  -    - .  . 

We  suggest   a   solution   for   the   following   problem. 

Problem:      Design     an      efficient      general-purpose   parallel   computer 
that   satisfies    three    requirements: 

(1)  The  design  space  (i.e.  ,  the  model  of  computation  for  which  programs 
are  written)  is  a  permissive  synchronous  shared  memory  model  of 
parallel  computation.  In  particular,  the  Fetch-and-*  Parallel  RAM(F&* 
PRAM). 

It  is  slightly  more  permissive  than  the  concurrent-read 
concurrent-write  parallel  RAM  (CRCW  PRAM).  See  Stockmeyer  and 
Vishkin  [SV-82c]  for  a  formal  definition  of  the  CRCW  PRAM.  The  CROT 
PRAM  consists  of  a  sequence  P^,?2>«««.Pp  of  RAM's  operating 
synchronously  in  parallel.  Each  individual  RAM  is  similar  to  a 
standard  niprocessor  model  as  defined  in  [AHU-7A],  Chap.  1.  In 
particular,  each  RAM  is  assumed  to  have  its  own  local  random-access 
memory  and  has  instructions  for  typical  arithmetic  and  boolean 
operations      and      for   reading   from  and   writing   into   its    local   memory. 
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The  RAIl's  also  have  access  to  a  shared  memory  of  size  m,  and  each 
RAM  has  instructions  for  reading  from  and  writing  into  the  common 
memory  using  one  of  its  private  registers  to  specify  the  common 
memory  address.  Several  processors  may  read  simuitianously  from  the 
same  memory  location.  If  more  than  one  processor  attempts  to  write 
into  the  same  location  in  common  memory  at  the  same  time,  the  lowest 
numbered  processor  succeeds.  Let  us  go  back  to  the  F&*  PRAM, 
Let  A  be  a  common  memory  address,  e^  be  a  local  register  of 
processor  P^  and  *  be  an  associative  and  commmtative  operation. 
Define  the  Fetch-and-*  (F&*)  instruction  as  follows.  (It  is  similar 
to  [GGKMRS-83]).  If  processor  P_j^  performs  an  F&*CA,e^)  and  no  other 
processor  performs  at  the  same  time  an  instruction  that   relates   to 


address  A  then  a  local  register  of  P^  -is  assigned  witFTySand  A  is 
assigned  with  A*e^.  Suppose  that  several  processor  perform 
simultaneously  F&*  (for  the  same  *  operation}  instructions  that 
relate  to  A.  The  result  is  defined  to  be  as  if  they  performed  these 
instructions  serially  in  some  order. 

(Remark.  We  assume  that  no  processor  is  seeking  access  to  address  A 
with  another  type  of  instruction  or  with  an  F&*  instruction  for 
another  *  operation.  If  this  happens  the  algorithm  is  considered 
illegal.  Alternatively,  some  default  results  can  be  imagined.)  The 
F&*  PRAM  is  a  CRCT-7  PRAM  that  allows  these  F&*  instructions  for  some 
set  of  *  operations.  Each  instruction  takes  one  time  unit  (uniform 
cost  criterion).  Both  the  program  and  the  input  are  located  in  the 
common  memory. 

(2)  The  implementation  space  (i.e.  the  model  of  computation  in  which 
the  machine  is  specified)  is  a  synchronous  distributed  model  of 
parallel  computation,  where  each  processor  is  conoiected  in  a  fixed 
pattern  to  a  small  number  of  others. 

(3)  There  is  an  efficient  automatic  procedure  that  translates  every 
algorithm  for  the  design  space  into  the  implementation  space. 

This  presentation  of  the  problem  explains  why  we  call  our  solution 
a  parallel-uesign  distributed-impleraentation  (PDDI)  computer. 

This   problem  lies   in   the  heart   of   the   theory   of   parallel 
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computation.  An  efficient  solution  of  the  problem  plays  also  a  central 
role  in  the  theory  of  distributed  computation^  since  it  leads  to  a 
utilization  of  distributed  machines  visualising  a  mathematically 
appealing  and  effective  design  space.  In  general,  it  seems  unlikely 
that  programmers  will  be  able  to  write  efficient  algorithms  directly 
for  fixed  pattern  distributed  machines  even  if  the  fixed  pattern 
changes  from  one  algorithm  to  the  other  as  implied  by  the 
general-purpose  distributed  computer  of  Galil  and  Paul  [GP-83].  (The 
term  "distributed"  in  the  present  paper  corresponds  to  "parallel"  in 
[GP-83)).  This  explains  why  our  problem  is  more  general  than  theirs: 
their  design  space  is  a  synchronous  distributed  model  of  computation. 
Lev,  Pippenger  and  Valiant li-  ^~"h  implies  that  there  exists  a  simple 
translation  of  a  prograiP.  in  their  design  space  into  our  design  space  in 
constant  time  using  the  same  order  of  the  number  of  processors.  Thus, 
our  simulation  can  be  atili'.zed  to  solve  the  problem  of  simulating  every 
special-purpose  synchronous  disti-ibuted  machine  on  our  PDDI  machine.  A 
simulation  that  solves  this  problem  is  the  main  contribution  of 
[GP-83].  The  worst  case  time  analysis  of  their  solution  is  the  same  as 
ours  (without  the  improvement  due  to  the  efficient  version  of  Section 
6)  for  comparable  cases.  However,  our  solution  allows  more  general 
patterns  of  communication  for  the  design  space  and»  therefore,  equips 
the  designer  with  more  powerful  design  tools.  For  example,  information 
which  is  known  to  one  procesor  only  (it  appears  in  its  local  memory) 
may  become  known  to  any  subset  of  the  processors  in  constant  time 
through  the  common  memory  by  utilizing  both  the  common  memory  and 
simultaneous  reads  from  the  same  common  memory  location.  While  a  time 
lower  bound  of  the  order  of  the  logarithm  of  the  number  of  processors 
can  be  readily  established  for  instances  of  this  problem  in  a 
synchronous  distributed  model  where  the  degree  of  each  node  is  bounded 
by  a  constant,  due  to  fan-in  considerations. 

Our   solution  compares  favorably  with  the  'naive'  solution  for  the 

aiain  problem.   By  the  naive  solution  we  mean  the  following:   There  eT>^ 

(1)   p    (balanced)   binary   trees   each  having  m  leaves,   called 

processor-trees',  and  (2)  ra  (balanced)   binary   trees   each  having  p 

leaves,  called  'memory-trees' ;  each  of  the  leaves  of  a  processor-tree  is 
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shared  with  a  leaf  of  a  distinct  memory-tree.  Each  processor-(resp. 
meraory-)tree  corresponds  to  one  of  the  F&*  PRAM  p  processors  (resp.  m 
common  memory  location).  The  communication  between  a  processor  and  a 
common  memory  location  is  simulated  in  the  obvious  way  via  their  shared 
leaf.  Extending  it  for  a  translation  of  F&*  PRAM  algorithms  by  this 
synchronous  distributed  machine  is  straightforward.  The  naive  solution 
nultiplies  time  requirements  by  O(log  m  +  log  p)  and  processor 
requirements  by  0(m).  Use  of  pipelining  in  a  way  similar  to  our 
solution  can  further  improve  this  solution.  The  main  disadvantage  of 
this  approach  is  the  relatively  lar^e  number  (0(pm))  of  "auxiliary" 
processors  required.  This  inefficiency  is  due  to  the  fact  that  each 
leaf  is  dedicated  to  simulate  communication  between  a  certain  processor 
and  a  certain  common  memory  location  regardless  of  the  need  for  such 
communication  in  the  time  unit  being  simulated.  Our  solution  provides 
for  a  dynamic  assignment  of  auxiliary  processors  for  this  purpose, 
thereby  substatially  reducing  the  number  of  auxiliray  processors. 
Applications  of  this  technique  can  be  found  in  Eckstein  [Ec-79a],  Lev 
[Lev-80]  (for  related  simulation  problems)  and  Thompson  [Th-82]  (for 
sorting).      This   technique  is   sometimes  called   "Orthogonal  Trees". 


Simulations  of  tightly  coupled  parallel  computation  models  by  a 
distributed  model  of  computation  is  also  studied  in  a  few  other  papers. 
Each  of  these  works  either  solves  another  problem  than  ours  or  does  not 
provide  for  a  worst-case  efficient  solution.  Lev,  Pippenger,  and 
Valiant        [LPV-81 ]        mention        simulations        of        the  exclusive-read 

exclusive-write  (ERH-/)  PPvAM  model,  where  concurrent  access  of  more  than 
one  processor  to  the  same  common  memory  location  is  forbidden.  Borodin 
and  Hopcroft  [BH-82]  outline  another  solution  for  our  problem  for  the 
case  p  =  m.  Ue  refer  to  their  simulation  later  in  the  paper.  Vishkin 
[Vi-81a]  prec-tnts  a  solution  for  an  easier  problem.  The  implementation 
space   is    an  EREt-/  RAM  and   not   a   distributed  machine. 

The  com.prehensive  paper  [Sc-80]  describes  the  "Paracomputer",  a 
model  of  parallel  computation  very  similar  to  our  CRCW  PRAM  and 
proposes   the   former      as      a     model      suitable      for      studying      theoretical 
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aspects  of  parallel  computation.  Various  Paracomputer  algorithms  are 
implemented  in  the  "Ultracomputer"  (a  perfect  shuffle  interconnection 
machine).  The  paper  Gottlieb  et  al.  [CGKMRS-83]  suggests  to  replace 
the  CRCW-PRAM-P a ra computer  by  a  Fetch-and-Add-PRAM-Paracomputer  and  the 
Pef ect-Shuf le-Ultracomputer  by  another  interconnection  network.  The 
automatic  procedure  for  the  simulation  of  the  Paracomputer  by  the 
Ultracomputer  which  is  suggested  is  claimed  to  satisfy  a  good 
average-case  criterion.  No  claims  are  made  regarding  worst-case 
criteria  that  this  simulation  satisfies.  Note  finally  that  the 
subsection  on  "alignment  netwoiks"  by  Kuck  [Ku-77]  contains  a  survey  of 
known   interconnection  networks   for   processors   and  memories. 

The  general  design  of  the  machine  is  given  in  the  next  section. 
It  is  followed  by  a  few  details  that:  prepare  the  reader  for  the 
simulation.  Section  3  gives  an  important  part  of  the  simulation.  Its 
correctness  is  proven  in  Section  4.  Other  parts  of  the  simulation  are 
discussed  in  Section  5.  Ar.  efficient  version  of  our  simulation  that 
utilizes  pipelining  ar.d  give  our  result  an  edge  over  previously 
suggested  simulations  appears  in  Section  &.  Section  7  includes  a  few 
concluding   remarks. 

1 1 .      Preliminaries.  .. 

In  outlining  the  solution,  there  was  an  effort  to  describe  it  i\i 
the  most  general  form  leaving  as  much  freedom  to  the  reader  as  possible 
for  filling  in  details  that  might  have  a  few  alternatives.  More 
specifically,  an  implementation  scheme  is  given  that  reduces  the 
simulation  problem  into  the  problems  of  designing  networks  for  sorting 
and  merging.  These  problems  are  probably  among  the  first  to  be 
considered  m  any  new  technology.  [Th-82]  describes,  for  example, 
thirteen  (!)  algorithms  for  sorting  in  a  model  of  VLSI.  Let  us 
describe   the  machine. 

The  synchronous  distributed  computer  (SDC):  The  machine  has  a 
sequence  of  RAM's  S^,S2,...,Sg  to  be. called  super-processors.  Each 
individual   Juper-processor   has      many      properties      in      common     with      the 
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processors  of  the  F&*  PRAM.  Actually,  the  description  of  the 
processors  of  the  parallel  computation  model,  up  to  the  point  where  we 
start  discussing  their  access  to  the  common  memory  (that  does  not  exist 
here),  is  similar  for  the  super  processors.  Our  model  employs  also  two 
families  of  fairly  degenerate  processors:  1.  A  sequence  of  processors 
Rj  ,R2, . . .  .Rj-  ,  called  comparator-processors  ;  and  2.  A  third  sequence  of 
processors  Mj^  ,M2, . . .  ,M^  called  memory-processors. 

Each  of  the  comparator-processors  has  instructions  for  checking 
the  predicates  =  and  <.  Each  of  the  comparator  or  memory  processors 
has  a  small  local  memory;  it  can  read  from  and  write  into  its  local 
memory;  it  can  perform  only  the  *  operations  that  we  wish  to  include  in 
the  F&*  instruction  of  our  F&*  PRAM  design  space.  It  has  a  program 
which  is  independent  of  the  algorithm  being  implemented.  This  program 
is  located  m  its  local  memory. 

All  processors  operate  synchronously  in  parallel.  They  can  be 
thought  of  as  nodes  (vertices)  that  are  connected  by  lines  (edges) 
forming  a  graph  of  communication.  Figure  1  illustrates  the  general 
structure  of  the  SDC  graph  of  communication;  super-processors  are 
represented  by  circles  and  memory-processors  by  triangles. 
Comparator-processors  only  fill  the  area  referred  to  as  sorting  and 
merging  networks.  Each  processor  has  an  additional  instruction  that 
serves  as  the  main  communication  tool.  The  information  to  be 
communicated  is  loaded  into  a  communication  register  which  corresponds 
to  the  adjacent  line  on  which  this  information  is  to  be  transmitted. 
The  processor  on  the  other  side  of  the  line  may  read  this  register. 
The  degree  of  each  vertex  of  our  graph  of  communication  does  not  exceed 
A.  Every  instruction  takes  one  time  unit  (uniform  cost  criterion). 

Note  that  we  do  not  specify  the  sorting  and  merging  network.  Our 
results  and  analysis  hold  for  any  such  networks.  For  instance,  we  may 
use  Batcher's  networks. 

Our  goal  is  to  implement  the  F&*  PRAM  into  the  SDC. 
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Wa  define  a  1-1  correspondence  between  the  p  processors  of  the  F&* 
PRAM  model  and  the  p  super-processors  of  the  SDC.  Each  super-processor 
S^  is  'responsible'  to  simulate  the  behavior  of  its  corresponding  P^^  . 

Another  1-1  correspondence  is  defined  between  the  m  addresses  of 
the  common  memory  and  the  memory-processors.  Each  memory-processor  M- 
is  resonsible  to  store  and  simulate  the  updates  of  the  content  of 
(common)  memory  address  i.  The  simulation  of  access  of  processors  to 
the  common  memory  is  done  by  the  super-processors  and  the 
memory-processors  through  a  network  of  nodes  and  lines.  All  the  nodes 
in  this  network  are  comparator  processors. 

To  distinguish  between  time  units  of  the  algorithm  and  its 
implementation  we  refer  to  the  former  as  pulses .  We  assume  that  the 
pulses  can  be  partitioned  into  three  sets:  reading  pulses,  F&*  pulses, 
and  writing  pulses.  ,  r 

Remark.   The  variant  of  simultarieous  access  to  the  same  (common)  memory 

location  for  a  mixed  objective  (e.g.   two  or  more  of  reading,  F  &*  and 
writing)  can  be  avoided  without   changing   the   running  time   of   the 

algorithm  by  an  order  of  magnitude.   Break  each  time  unit  into  three. 

In  the  first  third  part  of  the  reading  is  performed,  in  the  second   the 

F  &*  and  in  the  third  the  writing. 

Remark.   Let  Seq(n)  be  the  lowest  worst-case  upper  bound  on  the  running 

time  of  a  sequential  algorithm  for  a  certain  problem  of  input   size  n. 

Obviously,    the   best  upper  bound   on   the  parallel   running  time 

achievable,  without  improving  the  sequential  result,  for  an  algorithm 

using  p  processors   in   the  FSr*  PRAM  is  of  the  form  0(Seq(n)/p).   An 
algorithm  that  achieves  this  running  time   is   said   to  have   optimal 

speed-up",   or  more   simply   to   be   "optimal".    Upper   bounds  on  the 

worst-case  resource  requirements  of  algorithms  which  are   designed   for 

parallel  PRAM's   are  usually  presented  r.s :  Depth  0(y)  for  z  processors 

and  m  common  memory  locations  (y,z,m  may  be   functions   of   the   input 

parameters.)  An  equivalent   formulation   of   such   a  result  is:  Depth 

0((y*z)/p)  for  all  p  <  z  processors  and  m  common  memory   locations   for 

the  same  y,z  and  m.  We  use  mostly  the  second  formulation. 
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IV,   Simulation  of  a  reading  pulse. 

Say  that  processor  P^  ,  1  <  i  <  p,  wishes  to  read  from  address  a^ 
of  the  common  memory.  Then  S-  (the  corresponding  super-processor) 
communicates  this  by  writing  into  its  communication  register.  In  case 
T^  does  not  wish  to  read  at  the  pulse,  a^^  is  assigned  fictitiously  a 
nonexistent  address.  (This  nonexistent  address  is  the  default  content 
of  Pj^'s  communication  register.)  So,  if  ?^  wishes  to  perform  any 
instruction  other  than  having  access  to  the  common  memory  then  S-  (the 
corresponding  superprocessor )  can  do  it  in  one  time  unit. 

The  simulation  of  reading  from  the  common  memory  involves  four 
steps. 

Step  1.  Apply  a  sorting  network  (e.g.,  Batcher's  [Ba-68]  network) 
to  sort  the  pairs  (aj^,l),  (a2>2),  ...,  (a  ,p)  in  the  lexicographic 
order.  Namely,  (a^,i)  <  (a.,j)  iff  a^^  <  a^  or  a^  =  a^  and  i  <  j. 
Denote  the  output  of  the  sorting  procedure  by   (a.  ,jj^),   la^  >J2)> 

•  •• >  (® j  » Jp)* 

Processors   that  wish  to  read  from  the  same  address  are  represented 

in  this  output  by  consecutive  serial  numbers,  and   sorted  according 

to   their   original  serial  numbers.    As  we  mentioned   earlier, 

comparator-processors  take  the  place  of  comparator  modules   in   the 

sorting  network   (and  in  the  merging  network  of  the  next  step).   In 

addition  to  their  functioning  as  such,  they  keep  for  each  input   the 

line  on  which  it  arrived.   It  is  used  in  Step  4  for  returning  each 

(a.,i)  along  the  same  path  it  arrived  on. 

Remark.   A  similar  application  of  sorting  appears  also  in   [BH-82]   and 

[Vi-81a].   However,  the  contribution  of  this  paper  is  in  the  next  step 

where  we  show  how  to  use  merging  networks  for  the  purpose   of   fanning 

out   the  contents  of  memory  cells  to  processors  that  seek  to  read  them. 

The  use  of  merging  networks  enables  us  to  use  pipelining   (see   Section 

6)   in  the  efficient  version  of  our  simulation,  which  improves  the  time 

complexity  of  [BH-82].   Also,  the  number  of  auxiliary  processors   seems 

to  be  smaller  than  theirs.   It  should  be  noted  that  they  do  not  specify 

the  number  of  processors.   The  above  applies  for  comparable  cases  only, 

since   [BH-82]   presented   their   solution  only  for  the  case  p  =  m.  The 
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merging  networks  are  used  in  a  nonstandard  way  since  we  are  interested 
in  the  intermediate  computations  of  the  network,  rather  than  in  the 
merging  itself.  In  a  companion  paper  [Vi-83a]  .we  coin  the  name 
'lucid-box  composition  technique'  (as  oppas-ed  to  black-box 
compositions)  for  this  wider  notion  for  using  effectively  intermediate 
results  of  existing  (or  supposedly  existing)  procedmres  for  the  purpose 
of  designing  new  ones.  This  paper  includes  more  e:xamples  where  this 
technique  is  useful. 

Step  2.  Apply  a  merging  network  (e.g.,  Batcher"s  network)  to  merge 
the  output  of  the  sorting  network  (a.  ,j^),  (aj  -J2)'***»  ^^j  »Jp)» 
and  the  (sorted)  list  of  addresses  of  the  common  memory  denoted 
(w.Jl.g.)  by  l,2,...,m.  For  the  merging,  we  defime  (a^,i)  <  j  iff  a^ 
<  j.  The  comparator-processors  of  the  merging  nuetwork  keep  for  each 
input  pair  (a^,i)  the  line  on  which  it  arrived. 

To  each  memory  address  i  we  attach  its  content  c-  that  moves 
together  with  i  in  the  merging  network.  T®  each  pair  (a.,j)  we 
attach  a  variable  d.,.  Upon  entering  the  merging  network  at  the 
beginning  of  Step  2,  d.  is  'undefined'  for  each  j,  1  <  j  <  p. 
Whenever  a  memory  address  j  meets  a  pair  (a^,i),  such  that  a^  =  j, 
in  a  comparator-processor  of  the  network  we  copy  the  content  of  this 
address  into  d^.  Whenever  two  pairs  (a^,i)  and  (a-,j)  such  that  a^ 
=  a  J  meet  at  a  comparator-processor  and  one  of  them,  say  (a^,i), 
found  already  in  a  previous  time  unit  its  d^  i-d^  i=  'undefined')  we 
copy  this  value  into  d.. 
Figure  2  illustrates  Step  2. 


Actually,  we  eliminate  the  output  lines  of  t&e  merging  network, 
since  we  are  not  interested  (as  explained  later)  in  the  merging  itself; 
so  the  last  layer  of  the  Batcher  merging  network,  for  instance,  will  be 
the  last  'layer'  of  our  implementation  network.  The 
comparator -processor  of  this  last  layer,  together  with  the  first  and 
last  comparator-processor  of  the  first  layer  of  the  Batcher 'smerging 
network  are  referred  to  as  terminals  as  there  are  output  lines  on  t.>e 
network  that  emanate  from  them.  It  should  be  clear  how  to  generalize 
the  definition  of  terminals  to  general  merging  networks.  In  the  next 
section  we   prove   that  upon  finishing  Step  2,  when  [(a.,j),  d-]  is 
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transmitted  to  a  corresponding  terminal  comparator-processor  d.  =  c  , 
namely,  d.  equals  the  content  of  memory  address  aj  which  is  exactly 
what  super-processor  S,  wishes  to  read. 

Step  3.   Each  [(a^,i),  d^^]  is  returned  through  the  merging  network, 

along  the   paths   it   traversed   during  Step  2,   to  its   'output 

processor'  of  the  sorting  network. 

Step  A.   It  is  further  returned  to  its  S.. 

IV.   Correctness  proof  of  the  reading  pulse  simulatiort 

To  remind   the   reader   [(aj,j),   d^]   correspomds   to  a  message 

originated  at   super-processor  j.  This  super-processor  wishes  to  read 

from  address  a..   We  claim  that  upon  arriving  at   its   corresponding 

terminal  comparator-processor,   at  Step  2,   the  variable  d.,  of  this 

message,  is  assigned  with  c^  ,  the  content  of  address  a-.   Sending   the 

j 
message   [(a^,j),d-]   back   to  super-processor   j  ia  a  later  step  will 

achieve  the  goal  of  bringing  c   to  j's  'knowledge'. 

j 

Let  P^  ,P^  ,...,P^   (i]^  <  i2  <  ...   <  ±\^)   be  all  the  processors 
that  wish  to  read  from  the  common  memory  address  j  Ca-   =  i  for  I  <    SL  < 

H 

k)  in  the  current  Dulse. 


Let  us  define  recursively  the  set  V.  that  contains 
comparator-processors  and  memory  processor  j  (vertices)  as  its 
elements : 


1-    j  ^  ^j 

2.    V  e  V.  ,  if  for  some  v?  e  V^  ,  there  is  a  line  directed  from  w   to 
V  such  that  the  message  (i.c-)  or  %  message  [<a^-  ,i5),d^  ],  fl  < 
I  <    k),  was  transmitted  at  Step  2  along  this  Line. 
In  Figure  2,  see,  for  example,  the  set  V-. 

Remark.  In  general  merging  networks  't  is  possible  {apparently  due  to 
redundancy)  that  the  set  V.  induces  a  connected  graph  which  is  not  a 
tree.   It  is  easy  to  see  that  in  Batcher's  network  It  induces  a  tree. 
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Theorem.    Let   v  be   a   terminal   comparator-processor   which 

corresponds   to  a  message   [(a^  ,ip),d,-  ]   (1  <  1  <  k).   (Namely  this 

message  is  received  at  v  but  not  transmitted  further  by  Step  2   of   the 

simulation.)  Then  v  e  V..   (Remember  a^   =  j») 

-J  £ 

Remark.   The   theorem  readily   implies  the  correctness  of  the  reading 

pulse  simulation.   This  is  because  every  comparator-processor  v  in  V. 

'gets'  the  value  of  c-  through  a  line  that  implied  its  membership  in  V. 

(see  the  recursive  definition  of  V.)  and  therefore  each  d,-   is  assigned 

•J  £ 

by   c^  until  the  message  [ (a,  ,ip),d,  ]  arrives  to  its  corresponding 

J  ■^£   *"    ^i 

terminal  comparator-processor. 


Proof.   The  only  fact  required  for  the  proof  is  that   the  merging 

network   is   correct.   The  output  of  our  merging  includes  a  successive 

list  of  k+1  elements  sorted  in  the  following  order: 

[(a.   ij)d.  ],  [(a.  ,i2),d.  ],  ...,  [ (a.  ,i^) ,d^  ]  ,  (j,c.). 
1      1       2       2  k       k       -J 

We  assume  that  common  memory  addresses  are  limited  to  integers, 
and  so  are  the  a^'s.  (There  could  be  a  problem  when  a^  corresponds  to 
a  nonexistent  address.)  Imagine  that  instead  of  (j>c-)  we  take 
(j-l/2,Cj)  leaving  all  the  other  inputs  as  they  were.  In  that  case, 
the  same  list  of  k+1  elements  is  placed  in  the  same  block  of  the  ouptut 
but  in  a  different  order.  The  new  order  is: 
(j-l/2,cj),  [(a^  .ij),  d^    ],    ...  . 

/(Remark.   Although  we  do  not  need  the  output  lines  of  the  merging 
network  for  the  simulation,  we  use  them  here  for  the  sake  of  argument.) 

Claim.  (1)  Every  line  which  is  traversed  by  one  of  our  k+1 
messages  in  the  first  application  of  the  network  is  traversed  by  one  of 
the  corresponding  k+1  messages  in  the  second  application  and  vice 
versa.  (2)  Each  message  other  than  these  k+1  messages  traverses 
exactly  the  same  lines  in  both  applications. 

Proof  of  the  claim.  If  a  cert'iin  entering  line  to  a 
comparator-processor  transmits  one  of  che  k+1  messages  in  each 
application  (not  necessarily  the  same)  and  the  other  entering  line  to 
this  comparator-processor  transmits  one  of  the  other  messages  in  both 
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applications,  the  result  of  the  comparison  will  be  the  same.  Thus, 
corresponding  inputs  will  be  sent  through  the  same  emanating  lines. 
The  case  where  both  inputs  belong,  or  do  not  belong,  to  the  k+1 
messages  in  both  applications  is  easy  because  no  'harm'  can  be  done. 

Let  us  look  at  one  of  the  [(a-  ,i„),dT  ]  messages.  We  wish  to 
prove  that  its  corresponding  terminal  comparator-processor  belongs  to 
Vj,  Let  us  trace  this  [(a,-  ,in),dv  ]  message  in  both  applications  of 
the  merging  network  described  above.  We  are  going  to  have  two  paths 
that  start  at  the  same  input.  Denote  by  v-^  the  last 
comparator-processor  that  belongs  to  both  paths  before  they  split  for 
the  first  time.  There  is  such  Vi  because  the  paths  do  not  end  at  the 
same  output  line.  Proving  that  v,  z  V.  would  imply  that  the 
corresponding  terminal  comparator-processor  to  our  message 
[(a.  ,i„  ),d,-  ]  belongs  to  V..  Just  apply  the  definition  of  V.  to  the 
path  traversed  by  the  message  from  v^  to  this  terminal 
comparator-processor  to  see  that.   So,  it  remins  to  show  that  v,  e  V.. 

Since  one  of  the  Vi's  entering  lines  input  the  message 
[(a-  ,in  ),d^  )]  in  both  applications  the  messages  input  on  the  other 
entering  line  could  not  be  the  same.  Let  V2  be  the 
comparator-processor  on  the  other  side  of  this  line.  As  V2  had  a 
different  output  from  one  application  of  the  network  to  the  other,  one 
of  its  entering  lines  had  to  transmit  different  inputs.  Let  v^  be  the 
comparator-processor  on  the  other  side  of  such  a  line,  and  so  on. 
Since  our  merging  network  is  finite  and  acyclic  we  arrive  eventually  at 
memory  processor  j.  This  is  because  it  is  'responsible'  for  the  only 
input  line  to  the  merging  network  which  inputs  different  data  in  both 
applications. 

Since  j  belongs  to  V.  so  does  vj^.  • 

V.  Simulation  of  F&*  and  writing  pulse'; 


Cosider  a  general  merging  network,  where  V.  does   not   necessarily 
induce  a   tree.    (Recall  the  remark  that  precedes  the  theorem  of  the 
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previous  section.)  Following  the  ideas  of  the  fan-out  method  of  Step  2 
in  the  reading  pulse  simulation,  we  can  identify  very  easily 
comparator-processors  whose  in-degree  in  the  graph  induced  by  V.  are  2. 
Eliminating  one  of  the  entering  edges  in  each  of  them  would  reduce  the 
induced  graph  to  a  tree.  Denote  this  tree  by  V^  .  The  Appendix  shows 
how  to  utilize  a  binary-tree  synchronous  distributed  machin*^.  for 
computation  of  an  F&*  instruction.  The  F&*  instructions  that  relate  to 
the  same  shared  memory  location  are  entered  into  the  leaves  of  the 
tree.  Then  we  climb  up  the  tree  from  the  leaves  to  the  root.  Later  we 
return  to  the  leaves.  The  main  idea  here  is  that  the  V^  trees  can 
serve  as  these  trees.  Let  us  look  at  Figure  1  and  try  to  imagine  the 
direction  in  which  the  data  moves  in  the  SDC  machine.  There  will  be 
six  steps  instead  of  four  in  the  reading  pulse  simulation.  In  the 
first  and  second  step£  the  information  moves  from  left  to  right  in  the 
sorting  and  merging  networks,  respectively..  The  V^  trees,  which  are 
subgraphs  of  the  merging  ntework,  lie  on  their  side,  the  roots  on  the 
left  of  the  leaves.  Climbing  up  the  trees  corresponds  to  moving  data 
from  right  to  left  in  the  merging  network.  This  is  done  in  Step  3.  In 
Step  A  we  climb  down  the  tree  and,  therefore,  move  from  left  to  right 
in  the  merging  network.  In  Steps  5  and  6  the  data  are  moved  from  right 
to  left  in  the  merging  and  sorting  networks,  respectively,  thereby 
transmitting  the  required  'partial  sums'  to  the  processors.  (The  sums 
were  transmitted  to  the  appropriate  common  memory  locations  in  Step  3). 

No  new  ideas  are  required  for  simulation  of  the  writing  pulse.   It 
is  left  as  an  exercise  for  the  interested  reader. 

Complexity.  Using  Batcher's  merging  and  sorting  networks;  the  sorting 
network  requires  0(lcg  p)  layers,  each  has  0(p)  comparator-processors. 
The  merging  network  has  O(log(p  +  m))  layers,  0(m  +  p) 
comparator-processor  in  each,  plus  m  memory  processors.  So,  to 
implement  one  pulse  of  the  algorithm  in  the  F&*  PRAM  into  the  SDC 
machine,  we  i.eed  O(log  p  +  log(m  +  p))  time. 

In  the  next  section  we  show  how  this  can  be  improved. 
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VI.   Efficient  simulation 

Let  us  have  a  second  look  at  a  pulse  of  an  algorithm  of  the 
F&*-PRAM.  Say  that  our  parallel  computation  model  employs  p  processors 
and  uses  common  memory  of  size  ra  as  before.  So  far,  we  have  presented 
a  simulation  of  this  algorithm  in  an  SDC  machine  which  uses  p 
super-processors,  m  memory-processors  and  f(p,m)  comparator-processors 
where  f  is  a  function  of  p  and  m  that  depends  on  the  sorting  and 
merging  networks  that  we  utilize.  Let  5.(p,m)  be  the  longset  directed 
path  starting  at  a  super  processor,  or  at  a  memory  processor  in  the 
combined  (both  merging  and  sorting)  network  of  the  implementation. 

In  this  section  the  number  of  the  super-processors  is  denoted  by 
s.  It  is  smaller  than  p,  the  number  of  processors  of  the  parallel 
computation  model.  While  m,  the  number  of  memory-processors,  is  the 
same  as  the  number  of  common  memory  addresses,  as  before. 

Super-processor  S-  ,  1  <  i  <  s,  will  be  'responsible',  during  this 
section,  to  simulate  the  behavior  of  processors  P-,  ^i+c*  ^i+2s'  *** 
in  each  pulse  of  the  algorithm  which  is  formulated  in  the  parallel 
computation  model.  Therefore,  we  sometimes  call  the  processors  of  the 
F&*  PRAM  design  space  virtual  processors. 

We  do  it  by  pipelining.  In  the  first  time  unit  of  the  pulse 
simulation,  we  start  a  process  similar  to  the  simulation  of  an 
algorithm  that  is  given  fpr  Pp...,Pg  only  and  m  common  memory 
addresses  by  Sj^,...,S  .  We  call  this  process  the  first  cycle  of  the 
pulse  simulation.  After  a  constant  number  of  time  units  we  start  a 
second  cycle  similar  to  the  pulse  simulation  of  an  algorithm  that  is 
given  for  ^r-+\>  >  •  •  t^2s  ^^^  ^  common  memory  addresses  by  Sj^,...,Sg,  and 
so  on.  Simulation  of  reading  pulses  does  not  require  more  than  that. 
Simulation  of  a  F&*  pulse  involves  the  following  additional 
observations. 

Let  y  denote  both  the  common  memory  location  that  a  certain  F  &* 
instruction   relates   to,   and  the  memory  processor  that  simulates  this 
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memory  address.  No  confusion  will  arise.  We  use  the  V'  trees  which 
are  formed  during  each  cycle  in  a  form  similar  to  the  previous  section. 
Due  to  a  small  change,  however,  these  trees  compute  the  F&*  instruction 
relative  to  the  present  pulse  (rather  than  to  the  present  cycle). 
Remember,  that  memory-processor  y  is  the  root  of  all  these  V'  trees. 
Recall,  from  the  way  the  previous  section  invokes  the  appendix  that  the 
computation  of  the  F&*  instruction  involves  computation  of  the  A 
numbers  (moving  up  the  V'  tree  or,  equivalently ,  moving  right  to  left 
in  the  merging  network)  followed  by  computation  of  the  B  numbers 
(moving  down  V^).  Here,  we  start  by  computing  similarly  the  A  numbers. 
Following  this,  memory-processor  y  has  its  A  number  in  V'  of  the 
current  cycle;  this  is  the  '*-sum'  of  the  contents  of  all  virtual 
processors  that  participate  ir.  the  F&*  insruction  and  are  simulated  by 
the  cycle.  (This  A  number  corresponds  to  A(a,l)  of  the  Appendix. 
Denote  it  by  A(y)  and  the  corresponding  K  uuniber  by  B(y).)  Let  us  call 
the  tiTDe  immediately  after  the  computation  of  B(y)  ■«-  B(y)  *  A(y)  by 
memory-processor  y  and  before  the  next  cycle  arrives  at  y  the  midpoint 
of  the  cycle.  It  should  be  obvious  that  at  this  time  memory-processor 
y  is  "ready"  for  the  cycle  that  follows  and  the  computations  for  this 
cycle  will  be  with  respect  to  the  pulse  being  simulated.  This  is  so 
since  the  only  interaction  betwen  cycles  is  at  the  memory  processors. 

The  degree  of  each  node  in  the  graph  of  communication  of  our  SDC 
machine  is  actually  bounded  by  sixteen  (physically  it  is  bounded  by 
four)  in  the  simulation  that  involves  pipelining.   This  is  because: 

1.  Each  line  can  be  used  simultaneously  in  both  directions   by   two 
different  cycles  of  the  same  pulse. 

2,  Two  cycles   of  a  simuiltatiou  of  a  F  &*  pulse  may  simultaneously 
use  the  same  (merging  network)  line  in  the  same  direction. 

If  we  wish  that  a  processor  of  the  SDC  machine  be  concerned  t'ith 
more  than  one  of  its  lines  at  a  time  we  can  do  it  by  partitioning  the 
lines  of  the  network  into  seventeen  sets  (the  maximum  degree  of  a   node 
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plus  one,  by  Vizing's  edge  coloring  Theorem  [Be-73])  in  such  a  way  that 
no  two  lines  of  the  same  set  share  a  node. 

So  it  takse  0(—+   £(s,m)]  time  units  to  simulate  one  pulse  of   the 

p 
algorithm.    If  —  >   £(s,m),   then  the  number  of  tiLme  units  is  simply 
s 

0(— ).   So,  to  sum  up, 
s 

Theorem.      Given  an  algorithm  with   depth  0(_)   foir  all  p  <      x     in     a 

P 
F&*  PRAM  with   p   processors   and   m  common  memory  addresses   where   t,    x  and 

m  are   some   numbers.      We   can   simulate      it      in     an     SDC     machine     with      s 

super-processors,      ra  memory  processors   and   f(s,m)   comiparator-processors 

t  X 

in  depth  0(— )   for   s  < 


's  i  (s  ,m) 

One  possible  configuration  (mentioned  above)  results  in  O(s(log 
s)**2)  +m  log  m)  degenerate  processors  and  il.(s,m)  =  ((log  s)**2)  +  log 
m  where  m  is  the  size  of  the  common  memory.  A  second  possible 
configuration,  which  uses  the  recently  suggested  sorting  network  of 
[AKS-83]  enables  us  to  replace  the  (log  s)**2  terra  by  log  s  in  the  last 
two        formulae.  However,        the      constants      involvted      in     the      second 

configuration  are   substantially   larger. 

VII.      Conclusion 

A  consequence  of  the  use  of  pipelining  in  our  efficient  simulation 
where  each  super-processor  simulates  the  behavior  o£  several  processors 
is  the  incentive  to  design  optimal  (or  close  to  optiLmal)  algorithms  for 
a  much  wider  range  of  processors  (virtual  processors) ,  then  actually 
available   and   to   use   the  extensive    'library'   of   sucfc  known  algorithms. 

The  result  of  this  paper  follows  [GGKMRS-83],  :[SV-82c],  [Vi-81a] 
and  [Vi-83b]  in  supporting  a  more  permissive  desi^r.  space  as  long  as 
the  fine  for  realizing  it  in  feasible  implementation!  spaces  does  not 
increase.  For  example,        consider        a        CREW         (concurrent-read, 

exclusive-write)  PRAM.  (Similar  to  [FW-7b]).  It  is;  a  CRCW  PRAM  that 
does  not  allow  simultaneous  access  to  the  same  memory  location  by 
several     processors      for     write      purposes.        One      could        expect        that 
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impleraenting  algorithms  given  in  a  CREW  PRAM  into  a  synchronous 
distributed  model,  where  each  node  has  a  small  degree,  will  require 
less  time  or  less  processors  than  doing  the  same  for  algorithms  given 
in  a  CRQ'J  PRAM  and  that  this  'ratio'  will  be  even  smaller  for 
implementing  the  F&*  PRAM.  By  '  'less'  we  mean  less  by  more  than  a 
consatnt  factor.  As  far  as  I  know,  every  efficient  solution  for  the 
first  problem  has  corresponding  solution  for  the  second  problem  that 
preserves  both  the  time  and  the  number  of  processors,  thus  supporting 
the  permissive  models  of  computation.  Moreover,  in  [Vi-83b]  it  is 
proven  that  this  situation  holds  for  any  "reasonable"  simulation  of  the 
CREW  PRAM  into  the  SDC. 

;  .0  -■  ^  .  -  -,  ■• 

Consider  the  case  where  the  ra  common  memory  addresses  are 
partitioned  among  N  memory-processors  where  N  <  m  and  only  one  address 
of  each  memory-processor  may  be  accessed  at  a  time.  Let  us  overview  an 
adaptation  of  the  PDDI  to  this  case  which  seems  to  approximate  better 
current  technological  limitations  (according  to  [Ku-77]  and 
[GGKMRS-83] ).  In  addition  to  the  sorting  and  merging  network  our 
revised  PDDI  includes  a  balanced  caaciiiiiSE*  binary  tree  having  p  (the 
number  of  super-processors)  leaves.  Each  leaf  is  connected  to  an 
output  line  of  the  sorting  network  (which  is  also  an  input  line  of  the 
merging  network).  Inputs  for  the  sorting  network  are  triples  of  the 
form  (m^,a^,i)  where  m^^  is  a  memory-processor  number  and  a^  is  an 
address  at  its  local  memory  (i  is  a  super-processor  number  as  before). 
The  binary  tree  is  used  to  queue  up  requests  for  distinct  addresses  of 
the  same  memory -processor  by  attaching  them  serial  numbers.  The  simple 
details  are  left  to  the  reader.  These  requests  are  pipelined  into  the 
merging  network  accordingly.  Note  that  there  is  an  unavoidable 
bottleneck  due  to  the  fact  that  only  one  address  of  each 
memory-processor  can  be  referenced  at  a  time.  We  do  not  elaborate  on 
further  details  regarding  this  extention  of  our  solution  since  no  new 
ideas  are  involved.  Finally,  we  would  like  to  mention  that  Mehlhorn 
and  Vishkin  [MV-82]  suggested  recently  a  few  strategies  to  control 
these  bottlenecks. 
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Appendix 
The  F&*  implementation  tree 

The  following  simple  binary  tree  synchronous  distributed  machine 
provides  for  the  implementation  of  the  F&*  PRAM.  It  is  similar  to  the 
Fetch-and-Add  implementation  of  [GGKMRS-83].  Let  n  be  a  positive 
integer.  For  simplicity  we  assume  that  n  is  a  power  of  2.  Let  a  = 
log  n  (all  logarithms  in  this  paper  are  to  the  base  of  2).  Let  T  be  a 
complete  binary  tree  with  n  leaves.  A  processor  is  associated  with 
every  node  in  the  tree.  It  is  represented  by  a  combination  [h,j],  h 
being  its  height  in  the  tree  j  and  its  serial  number  among  the  other 
nodes  at  the  same  height.  See  Figure  3.  Assume  ij  ,i2,  • .  .  ,i^^  are  k 
numbers  where  1  <  ij  <  i2  <  ...  <  ij.  <  .  n  and  k  is  some  integer.  There 
are  k  numbers  b^  ,b^  ,...,bj^  which  are  associated  with  leaves  [0,i]^], 
[0,i2],  ...,  [0,ij^],  and  a  number  c  associated  with  the  root  of  the 
tree.   As  before,  *  is  any  associative  and  commutative  binary  opeation. 

A  neutral  element  for  the  *  operation  is  denoted  by  0:  for 
instance,  the  neutral  element  for  the  +  operation  is  the  number  zero. 
We  use  0  to  denote  the  number  zero  as  well.   No  confusion  vrill  arise. 

Every  node  in  the  tree  is  associated  with  two  numbers,  A(h,j)  and 

B(h,j).   The  A  numbers  satisfy  initially 

A(i,j)  =  b.  if  i  =  0  and  j  =  ijj^  for  some  £,  1  <  £  <  k. 

'  0  otherwise 

The  B  number   of  the  root  satisfies  initially  B(a,l)  =  c.  Our  goal  is 

that  the  B  numbers  will  satisfy  eventually 

B(0,i^)  =  c  *  A(0,Ji)  *  A(0,i2)  *  ...  *  A(0,ij_i)  =  c  *  b^^  *  b^^  * 

for  1  <  j  <  k. 

and  B(a,l)  =  c  *  b,.   *  b.   *  ...   *  b.   . 

^1    ^2  ^k 

The  following  (distributed)  computation  is  performed  "up  the  tree" 
(from  the  leaves  to  the  root). 

A(L,i)  ■<-  A(h-l,2j-l)  *  A(h-l,2j)  ,   for  0  <  h  <  a 
Immediately   after   this   computation   reaches   the  root  we  perform  the 
following  computation  down  the  tree. 
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B(h  +  1,  I  )  *  A(h,  j  -  1)  if  j  is  even 
B(h,j) 

_B(h  +  l.ili)  else 

Right  after  the  computation  leaves  the  root  node  its  processor  performs 
•B(a,l)  ^  B(a,l)  *  A(a,l). 

Note  that  the  computation  time  is  proportional  to  the  height  of 
the  tree  i.e.   O(log  n) . 

The  V^  trees  which  are  obtained  by  the  simulation  are  binary  but 
not  complete.  Adapting  the  computation  described  above  to  these  trees 
is  very  simple.  Observe  that  node-processors  which  are  not  on  a 
shortest  path  from  a  root  to  an  active  leaf  (a  leaf  of  the  form  [O.ij,] 
for  some  £ ,  1  <  £  <  k)  do  not  participate  in  the  computation.  A 
node-processor  of  the  V^  tree  that  has  one  son  is  treated  as  if  it  is  a 
left  son.  .    .,  ' 
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Figure  1.   The  network  of  the  implementation 
By    y.  '   ' '    — »    we  mean  that    (aj.l)   is  transmitted  on  this  line  from  left  to  right  while 


[(aj,l),  dj]  IS  transmitted  from  right  to  left. 


Figure  2 

Example  of  step  2. 

p  =  m  =  8 
a,  =  aj  =  3 

^3  =  a^  -  aj  -  a^  -  7 

a^  =  ag  =  8 

Dotted  boxes  represent  comparator-processors  in  which  content  of  some  common  memory 
address  is  copied  in  Step  2.  Dashed  boxes  represent  comparator-processors  which  are  member 
of  V,^,  in  the  proof  of  the  theorem  of  Section'-v. 
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